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ABSTRACT 


Item Response Theory (IRT) is a ubiquitous model for under- 
standing humans based on their responses to questions, used 
in fields as diverse as education, medicine and psychology. 
Large modern datasets offer opportunities to capture more 
nuances in human behavior, potentially improving test scor- 
ing and better informing public policy. Yet larger datasets 
pose a difficult speed / accuracy challenge to contemporary 
algorithms for fitting IRT models. We introduce a variational 
Bayesian inference algorithm for IRT, and show that it is 
fast and scaleable without sacrificing accuracy. Using this in- 
ference approach we then extend classic IRT with expressive 
Bayesian models of responses. Applying this method to five 
large-scale item response datasets from cognitive science and 
education yields higher log likelihoods and improvements in 
imputing missing data. The algorithm implementation is 
open-source, and easily usable. 


1. INTRODUCTION 


The task of estimating human ability from stochastic re- 
sponses to a series of questions has been studied since the 
1950s in thousands of papers spanning several fields. The 
standard statistical model for this problem, Item Response 
Theory (IRT), is used every day around the world, in many 
critical contexts including college admissions tests, school- 
system assessment, survey analysis, popular questionnaires, 
and medical diagnosis. 


As datasets become larger, new challenges and opportuni- 
ties for improving IRT models present themselves. On the 
one hand, massive datasets offer the opportunity to better 
understand human behavior, fitting more expressive mod- 
els. On the other hand, the algorithms that work for fitting 
small datasets often become intractable for larger data sizes. 
Indeed, despite a large body of literature, contemporary 
IRT methods fall short — it remains surprisingly difficult 
to estimate human ability from stochastic responses. One 
crucial bottleneck is that the most accurate, state-of-the-art 
Bayesian inference algorithms are prohibitively slow, while 
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faster algorithms (such as the popular maximum marginal 
likelihood estimators) are less accurate and poorly capture 
uncertainty. This leaves practitioners with a choice: either 
have nuanced Bayesian models with appropriate inference or 
have timely computation. 


In the field of artificial intelligence, a revolution in deep gen- 
erative models via variational inference has demon- 
strated an impressive ability to perform fast inference for 
complex Bayesian models. In this paper, we present a novel 
application of variational inference to IRT, validate the re- 
sulting algorithms with synthetic datasets, and apply them 
to real world datasets. We then show that this inference 
approach allows us to extend classic IRT response models 
with deep neural network components. We find that these 
more flexible models better fit the large real world datasets. 
Specifically, our contributions are as follows: 


1. Variational inference for IRT: We derive a new 
optimization objective — the Variational Item response 
theory Lower Bound, or VIBO — to perform inference 
in IRT models. By learning a mapping from responses 
to posterior distributions over ability and items, VIBO 
is “amortized” to solve inference queries efficiently. 


2. Faster inference: We find VIBO to be much faster 
than previous Bayesian techniques and usable on much 
larger datasets without loss in accuracy. 


3. More expressive: Our inference approach is naturally 
compatible with deep generative models and, as such, 
we enable the novel extension of Bayesian IRT models 
to use neural-network-based representations for inputs, 
predictions, and student ability. We develop the first 
deep generative IRT models. 


4. Simple code: Using our VIBO python packag?] is 
only a few lines of code that is easy to extend. 


5. Real world application: We demonstrate the im- 
pact of faster inference and expressive models by ap- 
plying our algorithms to datasets including: PISA, 
DuoLingo and Gradescope. We achieve up to 200 times 
speedup and show improved accuracy at imputing hid- 
den responses. At scale, these improvements in effi- 
ciency save hundreds of hours of computation. 


http://github.com/mhw32/variational-item-response-theory-public 
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As a roadmap, in Sec. 2] we describe the item response theory 
challenge. In Sec. [3] we present a main algorithm. Finally, in 
Sec. [4] and [5] we show its impact on speed and accuracy. 


2. BACKGROUND 


We briefly review several variations of item response theory 
and the fundamental principles of approximate Bayesian 
inference, focusing on modern variational inference. 


2.1 Item Response Theory Review 

Imagine answering a series of multiple choice questions. For 
example, consider a personality survey, a homework assign- 
ment, or a school entrance examination. Selecting a response 
to each question is an interaction between your “ability” 
(knowledge or features) and the characteristics of the ques- 
tion, such as its difficulty. The goal in examination analysis 
is to gauge this unknown ability of each student and the 
unknown item characteristics based only on responses. Early 
procedures defaulted to very simple methods, such as 
counting the number of correct responses, which ignore dif- 
ferences in question quality. In reality, we understand that 
not all questions are created equal: some may be hard to 
understand while others may test more difficult concepts. 
To capture these nuances, Item Response Theory (IRT) was 
developed as a mathematical framework to reason jointly 
about people’s ability and the items themselves. 


The IRT model plays an impactful role in many large in- 
stitutions. It is the preferred method for estimating abil- 
ity in several state assessments in the United States, for 
international assessments gauging educational competency 
across countries {18}, and for the National Assessment of 
Educational Programs (NAEP), a large-scale measurement 
of literacy and reading comprehension in the US [35]. Be- 
yond education, IRT is a method widely used in cognitive 
science and psychology, for instance with regards to studies 


of language acquisition and development [7]. 


Figure 1: Graphical models for the (a) 1PL, (b) 2PL, and (c) 
3PL Item Response Theories. Observed variables are shaded. 
Arrows represent dependency between random variables and 
each rectangle represents a plate (i.e. repeated observations). 


IRT has many forms; we review the most standard (Fig. [1p. 
The simplest class of IRT summarizes the ability of a person 
with a single parameter. This class contains three versions: 
IPL, 2PL, and 3PL IRT, each of which differ by the number 
of free variables used to characterize an item. The 1PL IRT 
model, also called the Rasch model [34], is given in Eq. 


! (1) 


= 1+ e7(%i-4;3) 
where r;,; is the response by the 7-th person to the j-th 
item. There are N people and M items in total. Each 


P(Ti,g = lai, dj) 


item in the 1PL model is characterized by a single number 
representing difficulty, dj. As the 1PL model is equivalent 
to a logistic function, a higher difficulty requires a higher 
ability in order to respond correctly. Next, the 2PL IRT 
mode|’| adds a discrimination parameter, k; for each item 
that controls the slope (or scale) of the logistic curve. We 
can expect items with higher discrimination to more quickly 
separate people of low and high ability. The 3PL IRT model 
further adds a pseudo-guessing parameter, g; for each item 
that sets the asymptotic minimum of the logistic curve. We 
can interpret pseudo-guessing as the probability of success if 
the respondent were to make a reasonable guess on an item. 
The 2PL and 3PL IRT models are: 


1— 95 (2) 


P(Ti,j|a2, 5) [pe ny 


“Tre tang % 9% 
where d; = {k;,d;} for 2PL and d; = {g;,k;,d;} for 3PL. 
See Fig. [I] for graphical models of each of these IRT models. 


A single ability dimension is sometimes insufficient to capture 
the relevant variation in human responses. For instance, if 
we are measuring a person’s understanding on elementary 
arithmetic, then a single dimension may suffice in capturing 
the majority of the variance. However, if we are instead 
measuring a person’s general mathematics ability, a single 
real number no longer seems sufficient. Even if we bound 
the domain to middle school mathematics, there are several 
factors that contribute to “mathematical understanding” (e.g. 
proficiency in algebra versus geometry). Summarizing a per- 
son with a single number in this setting would result in a 
fairly loose approximation. For cases where multiple facets 
of ability contribute to performance, we consider multidi- 
mensional item response theory . We focus on 2PL 
multidimensional IRT (MIRT): 


1 


Tig = llai,k,;,d; => ——@_ a 
p( J | J i) 14+ ear ki 4; 


(3) 


a!) (K) 


where we use bolded notation a; = (a, a!,...a°*) to 
represent a K dimensional vector. Notice that the item 
discrimination becomes a vector of equal size to ability. 


In practice, given a (possibly incomplete) N x M matrix 
of observed responses, we want to infer the ability of all 
N people and the characteristics of all M items. Next, we 
provide a brief overview of inference in IRT. 


2.2 Inference in Item Response Theory 

Inference is the task of estimating unknown variables, such 
as ability, given observations, such as student responses. 
We compare and contrast three popular methods used to 
perform inference for IRT in research and industry. Inference 
algorithms are critical for item response theory as slow or 
inaccurate algorithms prevent the use of appropriate models. 


Maximum Likelihood Estimation A straightforward 
approach is to pick the most likely ability and item features 


?We default to 2PL as the pseudo-guessing parameter in- 
troduces several invariances in the model. This requires far 
more data to infer ability accurately, as measured by our own 
synthetic experiments. For practitioners, we warn against 
using 3PL for small to medium datasets. 
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given the observed responses. To do so we optimize: 


N M 


Lup = max — SS logp(rijlai,d;) (4) 


N M 
{aj}j_1 {dj tja1 i=1 j=1 


with stochastic gradient descent (SGD). The symbol d; rep- 
resents all item features e.g. dj; = {d;,k;} for 2PL. Eq. |dlis 
often called the Joint Maximum Likelihood Estimator ‘ 
abbreviated MLE. MLE poses inference as a supervised re- 
gression problem in which we choose the most likely unknown 
variables to match known dependent variables. While MLE is 
simple to understand and implement, it lacks any measure of 
uncertainty; this can have important consequences especially 
when responses are missing. 


Expectation Maximization Several papers have pointed 
out that when using MLE, the number of unknown parame- 
ters increases with the number of people [5] 17]. In particular, 
shows that in practical settings with a finite number 
of items, standard convergence theorems do not hold for 
MLE as the number of people grows. To remedy this, the 
authors instead treat ability as a nuisance parameter and 
marginalized it out (6. Brock et. al. introduces an 
Expectation-Maximization (EM) algorithm to iterate 
between (1) updating beliefs about item characteristics and 
(2) using the updated beliefs to define a marginal distribu- 
tion (without ability) p(ri;|d;) by numerical integration of 
a;. Appropriately, this algorithm is referred to as Maximum 
Marginal Likelihood Estimation, which we abbreviate as EM. 
Eq. [6] shows the E and M steps for EM. 


Estep: p(rij|d\) = i; P(rijlai,d\” )p(ai)da; (5) 
N 
: (t+1) _ | q® 
M step: d; = arg max ) (log p(risld; ) (6) 


i=l 


where (t) represents the iteration count. We often choose 
p(a;) to be a simple prior distribution like standard Normal. 
In general, the integral in the E-step is intractable: EM uses 
a Gaussian-Hermite quadrature to discretely approximate 
p(ris|a\?). See fad for a closed form expression for a 
in the M step. is method finds the maximum a posteri- 
ori (MAP) estimate for item characteristics. EM does not 
infer ability as it is “ignored” in the model: the common 
workaround is to use EM to infer item characteristics, then 
fit ability using a second auxiliary model. In practice, EM 
has grown to be ubiquitous in industry as it is incredibly fast 
for small to medium sized datasets. However, we expect that 
EM may scale poorly to large datasets and higher dimensions 
as numerical integration requires far more points to properly 
measure a high dimensional volume. 


Hamiltonian Monte Carlo The two inference meth- 
ods above give only point estimates for ability and item 
characteristics. In contrast Bayesian approaches seek to cap- 
ture the full posterior over ability and item characteristics 
given observed responses, p(@;, di:ar|Ti,1:ac) where ri.m = 
(ri1,:+: ,Ti,m). Doing so provides estimates of uncertainty 
and characterizes features of the joint distribution that can- 
not be represented by point estimates, such as multimodality 
and parameter correlation. In practice, this can be very 
useful for a more robust understanding of student ability. 


The common technique for Bayesian estimation in IRT uses 
Markov Chain Monte Carlo (MCMC) to draw sam- 
ples from the posterior by constructing a Markov chain care- 
fully designed such that p(ai, di:ac|ri,1:17) is the equilibrium 
distribution. By running the chain longer, we can closely 
match the distribution of drawn samples to the true pos- 
terior. Hamiltonian Monte Carlo (HMC) is an 
efficient version of MCMC for continuous state spaces. We 
recommend for a good review of HMC. 


The strength of this approach is that the samples gener- 
ated capture the true posterior (if the algorithm is run long 
enough). Bute the computational costs for MCMC can be 
very high, and the cost scales at least linearly with the num- 
ber of latent parameters — which for IRT is proportional 
to data size. With new datasets of millions of observations, 
such limitations can be debilitating. Fortunately, there exist 
a second class of approximate Bayesian techniques that have 
gained significant traction in the machine learning commu- 
nity. We provide a careful review of variational inference. 


2.3 Variational Methods Review 


Variational inference (VI) first appeared from the statistical 
physics community and was later generalized for many prob- 
abilistic models by Jordan et. al. [24]. In recent years, VI 
has been popularized in machine learning where it is used to 
do inference in large graphical models decribing images and 
natural language. The main intuition of variational inference 
is to treat inference as an optimization problem: starting 
with a family of distributions, the goal is to pick the one 
that best approximates the true posterior, by minimizing 
an estimate of the mismatch between true and approximate 
distributions. We will first describe VI in the general context 
of a latent variable model, and then apply VI to IRT. 


Let w € ¥ and z € Z represent observed and latent vari- 
ables, respectively. (In the context of IRT, x represents 
the responses from a single student and z represents ability 
and item characteristics.) In VI (4), we introduce 
a family of tractable distributions over z such that we can 
easily sample from and score. We wish to find the mem- 
ber qy*(x) € Q that minimizes the Kullback-Leibler (KL) 
divergence between itself and the true posterior: 


dy (w) (2) = arg ee Dxu(qu@)(2)||P(z|@)) (7) 


where 7)(a) are parameters that define each distribution. For 
example, (a) would be the mean and scale for a Gaussian 
distribution. Since the “best” approximate posterior q,* (2x) 
depends on the observed variables, its parameters have a as 
a dependent variable. To be clear, there is one approximate 
posterior for every possible value of the observed variables. 


Frequently, we need to do inference for many different values 
of x. For example, student A and student B may have picked 
different answers to the same question. Since their responses 
differ, we would need to do inference twice. Let pp(a) be an 
empirical distribution over the observed variables, which is 
equivalent to the marginal p(a) if the generative model is cor- 
rectly specified. Then, the average quality of the variational 
approximations is measured by 
p(#, z) 


E,.(m) |max E z) |log ———_ 8 
Pate ins a | eel 8) 
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In practice, pp(a) is unknown but we assume access to a 
dataset D of examples i.i.d. sampled from pp(a); this is 
sufficient to evaluate Eq. 


Amortization Asin Eq.[8} we must learn an approximate 
posterior for each a € D. For a large dataset D, this can 
quickly grow to be unwieldly. One such solution to this 
scalability problem is amortization [r6], which reframes the 
per-observation optimization problem as a supervised regres- 
sion task. Consider learning a single deterministic mapping 
fo: & — Q to predict 7* (a) or equivalently qy+ (2) € Q as 
a function of the observation x. Often, we choose fg to be a 
conditional distribution, denoted by qg(z|x) = fg(x)(z). 


The benefit of amortization is a large reduction in compu- 
tational cost: the number of parameters is vastly smaller 
than learning a per-observation posterior. Additionally, if 
we manage to learn a good regressor, then the amortized 
approximate posterior q¢(z|a) could generalize to new obser- 
vations « ¢ D unseen in training. This strength has made 
amortized VI popular with modern latent variable models, 
such as the Variational Autoencoder [25]. 


Instead of Eq.|8} we now optimize: 


p(x, z) 
mgxEnot [East [on eee] 
The drawback of this approach is that it introduces an amor- 
tization gap: since we are technically using a less flexible 
family of approximate distributions, the quality of approxi- 
mate posteriors can be inferior. 


Model Learning So far we have assumed a fixed gener- 
ative model p(x, z). However, often we can only specify a 
family of possible models pg(ax|z) parameterized by 0. The 
symmetric challenge (to approximate inference) is to choose 
@ whose model best explains the evidence. Naturally, we do 
so by maximizing the log marginal likelihood of the data 


log pe(x) = log [ po(@, 2)dz (10) 


Using Eq. [9| we derive the Evidence Lower Bound (ELBO) 
with qg(z|a) as our inference model 


Pe (a, z) A 
log pe(x) > Ey. (z\a hog a = ELBO (11) 
dy (z|x) do (za) 
We can jointly optimize ¢ and @ to maximize the ELBO. 
We have the option to parameterize pe(x|z) and qg(z|x) 
with deep neural networks, as is common with the VAE [25], 
yielding an extremely flexible space of distributions. 


Stochastic Gradient Estimation The gradients of the 
ELBO (Eq. with respect to ¢ and @ are: 
VeELBO = Ey, (z|x)[Vo log po (x, z)]] (12) 
V~ELBO = V gE qy(z\e) [log pe(@, z)| (13) 


Eq.[12]can be estimated using Monte Carlo samples. However, 
as it stands, Eq. is difficult to estimate as we cannot 
distribute the gradient inside the inner expectation. For 
certain families Q, we can use a reparameterization trick. 


Reparameterization Estimators Reparameterization 
is the technique of removing sampling from the gradient 


computation graph [37]. In particular, if we can reduce 
sampling z ~ q¢(z|a) to sampling from a parameter-free 
distribution € ~ p(e) plus a deterministic function application, 
Zz = go(e), then we may rewrite Eq. [13] as: 


po(w, z(€)) 

V¢ELBO = E,(<)[Vz log PRETO Wha, (14) 
which now can be estimated efficiently by Monte Carlo (the 
gradient is inside the expectation). A benefit of reparameter- 
ization over alternative estimators (e.g. score estimator 
or REINFORCE [44]) is lower variance while remaining unbi- 
ased. A common example is if qg(z|a) is Gaussian NV (11, 07) 
and we choose p(e) to be V(0, 1), then g(e) =exo+y. 


3. THE VIBO ALGORITHM 

Having rehearsed the major principles of VI, we will adapt 
them to IRT. In our review, we presented the ELBO that 
serves as the primary loss function to train an inference 
model. Given the nuances of IRT, we can derive a new loss 
function specialized for ability and item characteristics. We 
call the resulting algorithm VIBO since it is a Variational 
approach for Item response theory based on a novel lower 
BOund. While the remainder of the section presents the 
technical details, we ask the reader to keep the high-level 
purpose in mind: VIBO is an objective function that if we 
maximize, we have a method to predict student ability from 
his or her responses. As a optimization problem, VIBO is 
much cheaper computationally than MCMC. 


To show that doing so is justifiable, we prove that VIBO 
well-defined. That is, we must show that VIBO lower bounds 
the marginal likelihood over a student’s responses. 


THEOREM 3.1. Let a; be the ability for person i € [1, N] 
and d; be the characteristics for item j € [1,M]. We use 
the shorthand notation di: = (di,...,dar). Let ri,; be the 
binary response for item j by personi. We write rii:m = 
(ri,.--Ti,m). If we define the VIBO objective as: 


VIBO 4 Lrecon + Egy (dia Iria:m) [Davitity] + Ditem 
where 
Lrecon = Egy (ai.d1.0 I7i4:M) [log pe(Ti,1:1|ai, di:11)| 
Davitity = Du (qo(ai|di:m, Ti,1:)||p(ai)) 
Ditem = Du (de(di:m|ri,1:m)||p(di:a)) 


and assume the joint posterior factors as q¢(a@i, di:m|Ti,i:m) = 
q¢(aildi:m, Vi,1:m)q¢(di:m|ri:a), then log p(ri1:) > VIBO. 
In othe words, VIBO is a lower bound on the log marginal 
probability of person i’s responses. 


PrRoor. Expand marginal and apply Jensen’s inequality: 


fale oe G2) 


lo Tii.m) > E ; ; lo 
SPe(ris:m) = qo (@i.d1:m|Ti,1:M) 8 q¢(@i, di:m|ria:a) 


= DICH TEVAL STAY) [log pe (rit:m|ai, di:w)] 


p(ai) 
qe(ai|di:ar, i,t) 


p(di:) 


qe(di:a|ri,1:a2) 


+tkE 
4a (04,41: |Ti,1:M) og 


+ Bag (a; .d1:.0|T:,1:M) og 


— Lerecon + La + Lp 
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Rearranging the latter two terms, we find that: 


La = Egy (dieu lria:a) (PKL (96 (@i|di-m, Ti,1:m)||P(ai))] 
p(di:m) 

q¢(di:m|Ti,1:11) 

= Dxu(qe(di:a|ri,1:a1)||p(di:az)) 


Lp = Mg (di: |ri,i:M) log 


Since VIBO = Lrecon + La + Lp, and KL terms are non- 


negative, we have shown that VIBO bounds log po (ri,1:17). 


Thm. [3.1] leaves several choices up to us, and we opt for the 
simplest ones. For instance, the prior distributions are chosen 
to be independent standard Normal distributions: p(a;) = 
Th p(ai,k) and p(di:ar) = i ba p(d;) where p(ai,x) and 
p(d;) are N'(0,1). Further, we found it sufficient to assume 
d6(di:ac|Ti,t:m) = qo(di:m) = [hea qo(d;) although nothing 
prevents the general case. Initially, we assume the generative 
model, po(ri,1:11|@i, di:11), to be an IRT model (thus 6 is 
empty); later we explore generalizations. 


Algorithm 1: VIBO Forward Pass 


Assume we are given observed responses for person 1, Ti1:173 
Compute pa,04 = q6(di:m); 

Sample di:i. ~ N (ta, 03); 

Compute pa,o2 = q¢(aildi.m, Pi,1:12); 

Sample a; ~ N(pa,02); 

Compute Lrecon = log po(Ti,1:11|@i, di:ar); 

Compute Dapiity = Dki(N (ua, 72) ||V(0, 1); 

Compute Ditem = Dxi(N (pa, 03) ||N(0, 1)); 

Compute VIBO = Lrecon + Danity + Ditem 


The posterior qs(@i|d1:i, 7i,1:7) needs to be robust to miss- 
ing data as often not every person answers every question. 
To achieve this, we explore the following family: 


M 
qo(ail\dim, via) = | [ ae(aild;,ri,5) (15) 
j=l 
If we assume each component q¢(a@i|d;, i,j) is Gaussian, then 
qe(ai|di:ar,7i,1:17) is Gaussian as well, being a Product-Of 
Experts [45]. If item j is missing, we replace its term 
in the product with the prior, p(a;) representing no added 
information. We found this design to outperform averaging 
over non-missing entries: + beam qe(ai|d;, ri,;). 


As VIBO is a close cousin of the ELBO, we can estimate its 
gradients with respect to @ and ¢ similarly: 


VeVIBO = VoeLrecon 
-— Egy (a@i,ds:arlri,1-ar) [Vo log po(Ti,1:m|ai, di:a)) 
VeVIBO = Ve Egg (di-mlri,a:at) Dability] + V¢Ditem 


p(ai)p(di:m) 
do(ai, di:m|Ti,1:.1) 


As in Eq. we may wish to move the gradient inside the 
KL divergences by reparameterization to reduce variance. 
To allow easy reparameterization, we define all variational 
distributions q¢(-|-) as Normal distributions with diagonal 
covariance. In practice, we find that estimating Va VIBO and 
V~VIBO with a single sample is sufficient. With this setup, 


=Ve 4q%(@4,41:m|Ti,1:M) 


VIBO can be optimized using stochastic gradient descent 
to learn an amortized inference model that maximizes the 
marginal probability of observed data. We summarize the 
required computation to calculate VIBO in Alg. 


4. DATASETS 


We explore one synthetic dataset, to build intuition and 
confirm parameter recovery, and five large scale applications 
of IRT to real world data, summarized in Table [I] 


Table 1: Dataset Statistics 


# PERSONS # ITEMS’ MISSING DATA? 
CrITLANGACQ 669498 95 N 
WoORDBANK 5520 797 N 
DuoLINGO 2587 2125 ay, 
GRADESCOPE 1254 98 Y 
PISA 519334 183 Y 


Synthetic IRT To sanity check that VIBO performs as 
well as other inference techniques, we synthetically generate 
a dataset of responses using a 2PL IRT model: sample a; ~ 
p(a;:), dj; ~ p(d;). Given ability and item characteristics, 
IRT-2PL determines a Bernoulli distribution over responses 
to item 7 by person 7. We sample once from this Bernoulli 
distribution to “generate” an observation. In this setting, we 
know the ground truth ability and item characteristics. We 
vary N and M to explore parameter recovery. 


Second Language Acquisition This dataset contains 
native and non-native English speakers answering questions 
to a grammar quiq?} which upon completion would return a 
prediction of the user’s native language. Using social media, 
over half a million users of varying ages and demographics 
completed the quiz. Quiz questions often contain both visual 
and linguistic components. For instance, a quiz question 
could ask the user to “choose the image where the dog is 
chased by the cat” and present two images of animals where 
only one of image agrees with the caption. Every response is 
thus binary, marked as correct or incorrect. In total, there 
are 669,498 people with 95 items and no missing data. The 
creators of this dataset use it to study the presence or absence 
of a “critical period” for second language acquisition |19|. We 
will refer to this dataset as CRITLANGACQ. 


WordBank: Vocabulary Development The MacArthur- 
Bates Communicative Development Inventories (CDIs) are 
a widely used metric for early language acquisition in chil- 
dren, testing concepts in vocabulary comprehension, produc- 
tion, gestures, and grammar. The WordBank database 
archives many independently collected CDI datasets across 
languages and research laboratoried#| The database consists 
of a matrix of people against vocabulary words where the 
(i, 7) entry is 1 if a parent reports that child 7 has knowledge 
of word j and 0 otherwise. Some entries are missing due 
to slight variations in surveys and incomplete responses. In 
total, there are 5,520 children responding to 797 items. 


°The quiz can be found at 
data is publically available at|osf.io/pyb8s 
“github.com/langcog/wordbankr 


The 
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DuoLingo: App-Based Language Learning We ex- 
amine the 2018 DuoLingo Shared Task on Second Lan- 
guage Acquisition Modeling)| [38]. This dataset contains 
anonymized user data from the popular education applica- 
tion, DuoLingo. In the application, users must choose the 
correct vocabulary word among a list of distractor words. 
We focus on the subset of native English speakers learning 
Spanish and only consider lesson sessions. Each user has a 
timeseries of responses to a list of vocabulary words, each of 
which is shown several times. We repurpose this dataset for 
IRT: the goal being to infer the user’s language proficiency 
from his or her errors. As such, we average over all times 
a user has seen each vocabulary item. For example, if the 
user was presented “habla” 10 times and correctly identified 
the word 5 times, he or she would be given a response score 
of 0.5. We then round to 0 or 1. We revisit a continuous 
version Sec.|7| After processing, we have 2587 users and 2125 
vocabulary words with missing data as users frequently drop 
out. We ignore user and syntax features. 


Gradescope: Course Exam Data  Gradescope is 
a course application that assists teachers in grading student 
assignments. This dataset contains 105,218 reponses from 
6,607 assignments in 2,748 courses and 139 schools. All 
assignments are instructor-uploaded, fixed-template assign- 
ments, with at least 3 questions, with the majority being 
examinations. We focus on course 102576, randomly chosen. 
We remove students who did not respond to any questions 
and round up partial credit. In total, there are 1254 students 
with 98 items, with missing entries. 


PISA 2015: International Assessment The Programme 
for International Student Assessment (PISA) is an interna- 
tional exam that measures 15-year-old students’ reading, 
mathematics, and science literacy every three years. It is 
run by the Organization for Economic Cooperation and De- 
velopment (OECD). The OECD released anonymized data 
from PISA ’15 for students from 80 countries and education 
system>| We focus on the science component. Using IRT to 
access student performance is part of the pipeline the OECD 
uses to compute holistic literacy scores for different countries. 
As part of our processing, we binarize responses, rounding 
any partial credit to 1. In total, there are 519,334 students 
and 183 questions. Not every student answers every question 
as many versions of the computer exam exist. 


5. FAST AND ACCURATE INFERENCE 

We will show that VIBO is as accurate as HMC and nearly 
as fast as MLE/EM, making Bayesian IRT a realistic, even 
preferred, option for modern applications. 


5.1 Evaluation 

We compare compute cost of VIBO to HMC, EM|| and MLE 
using IRT-2PL by measuring wall-clock run time. For HMC, 
we limit to drawing 200 samples with 100 warmup steps 
with no parallelization. For VIBO and MLE, we use the 
Adam optimizer with a learning rate of 5e-3. We choose to 
conservatively optimize for 10k iterations to estimate cost. 


sharedtask.duolingo.com/2018.html 
oecd.org/pisa/data/2015database 
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Ve use the popular M package in R for EM with 61 
points for numerical integration. 
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Figure 2: Performance of inference algorithms for IRT for 
synthetic data, as we vary the number of people, items, and 
latent ability dimensions. (Top) Computational cost in log- 
seconds (e.g. 1 log second is about 3 seconds whereas 10 log 
seconds is 6.1 hours). (Middle) Correlation of inferred ability 
with true ability (used to generate the data). (Bottom) 
Accuracy of held-out data imputation. 


However, speed only matters assuming good performance. 
We use three metrics of accuracy: (1) For the synthetic 
dataset, because we know the true ability, we can measure 
the expected correlation between it and the inferred ability 
under each algorithm (with the exception of EM as ability 
is not inferred). A correlation of 1.0 would indicate perfect 
inference. (2) The most general metric is the accuracy of 
imputed missing data. We hold out 10% of the responses, 
use the inferred ability and item characteristics to generate 
responses thereby populating missing entries, and compute 
prediction accuracy for held-out responses. This metric is 
a good test of “overfitting” to observed responses. (3) In 
the case of fully Bayesian methods (HMC and VIBO) we 
can compare posterior predictive statistics to further 
test uncertainty calibration (which accuracy alone does not 
capture). Recall that the posterior predictive is defined as: 


P(Fiv:m|Pit:m) = Ep(a;.dy.ar|rir.ar) (P(Pi1:m|@i, di:1)| 


For HMC, we have samples of ability and item characteristics 
from the true posterior whereas for VIBO, we draw samples 
from the q¢(a@i, di:|Ti,1:). Given such parameter samples, 
we can then sample responses. We compare summary statis- 
tics of these response samples: the average number of items 
answered correctly per person and the average number of 
people who answered each item correctly. 


5.2 Synthetic Data Results 


With synthetic experiments we are free to vary N and M 
to extremes to stress test the inference algorithms: first, we 
range from 100 to 1.5 million people, fixing the number of 
items to 100 with dimensionality 1; second, we range from 
10 to 6k items, fixing 10k people with dimesionality 1; third, 
we vary the number of latent ability dimensions from 1 to 5, 
keeping a constant 10k people and 100 items. 
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(a) Real World: Accuracy of Imputing Missing Data vs Time Cost 
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(b) Real World: Posterior Predictive Checks 


Figure 3: (a) Accuracy of missing data imputation for real world datasets plotted against time saved in seconds compared to 
using HMC. (b) Samples statistics from the predictive posterior defined using HMC and VIBO. A correlation of 1.0 would be 
perfect alignment between two inference techniques. Subfigures in red show the average number of items answered correctly for 
each person. Subfigures in yellow show the average number of people who answered each item correctly. 


Fig. [2] shows run-time and performance results for VIBO, 
MLE, HMC, EM, and two ablations of VIBO (discussed in 
Sec. aut First, comparing parameter recovery performance 
(Fig.|2|middle), we see that HMC, MLE and VIBO all recover 
parameters well. The only notable differences are: VIBO 
with very few people, and HMC and (to a lesser extent) VIBO 
in high dimensions. The former is because the amortized 
posterior approximation requires a sufficiently large dataset 
(around 500 people) to constrain its parameters. The latter 
is a simple effect of the scaling of variance for sample-based 
estimates as dimensionality increases (we fixed the number 
of samples used, to ease speed comparisons). 


Turning to the ability to predict missing data (Fig. [2]bottom) 
we see that VIBO performs equally well to HMC, except in 
the case of very few people (again, discussed below). (Note 
that the latent dimensionality does not adversely affect VIBO 
or HMC for missing data prediction, because the variance 
is marginalized away.) MLE also performs well as we scale 
number of items and latent ability dimensions, but is less 
able to benefit from more people. EM on the other hand 
provides much worse missing data prediction in all cases. 


Finally if we examine the speed of inference (Fig. [2| top), 
VIBO is only slightly slower than MLE, both of which are 
orders of magnitude faster than HMC. For instance, with 
1.56 million people, HMC takes 217 hours whereas VIBO 
takes 800 seconds. Similarly with 6250 items, HMC takes 4.3 
hours whereas VIBO takes 385 seconds. EM is the fastest 
for low to medium sized datasets, though its lower accuracy 
makes this a dubious victory. Furthermore, EM does not 
scale as well as VIBO to large datasets. 


5.3. Real World Data Results 


We next apply VIBO to real world datasets in cognitive 
science and education. Fig. [3{a) plots the accuracy of imput- 
ing missing data against the time saved vs HMC (the most 
expensive inference algorithm) for five large-scale datasets. 
Points in the upper right corner are more desirable as they 
are more accurate and faster. The dotted line represents 100 
hours saved compared to HMC. 


263 


From Fig. [3{a), we find many of the same patterns as we 
observed in the synthetic experiments. Running HMC on 
CritLangAcq or PISA takes roughly 120 hours whereas VIBO 
takes 50 minutes for CritLangAcq and 5 hours for PISA, 
the latter being more expensive because of computation 
required for missing data. In comparison, EM is at times 
faster than VIBO (e.g. Gradescope, PISA) and at times 
slower. With respect to accuracy, VIBO and HMC are 
again identical, outperforming EM by up to 8% in missing 
data imputation. Interestingly, we find the “overfitting” of 
MLE to be more pronounced here. If we focus on DuoLingo 
and Gradescope, the two datasets with pre-existing large 
portions of missing values, MLE is surpassed by EM, with 
VIBO achieving accuracies 10% higher. 


Another way of exploring a model’s ability to explain data, for 
fully Bayesian models, is posterior predictive checks. Fig.[3[b) 
shows posterior predictive checks comparing VIBO and HMC. 
We find that the two algorithms strongly agree about the 
average number of correct people and items in all datasets. 
The only systematic deviations occur with DuoLingo: it is 
possible that this is a case where a more expressive posterior 
approximation would be useful in VIBO, since the number 
of items is greater than the number of people. 


5.4 Ablation Studies 


We compared VIBO to simpler variants that either do not 
amortize the posterior or do so with independent distributions 
of ability and item parameters. These correspond to different 
variational families, Q to choose q from: 


e VIBO (Independent): We consider the decomposition 
q(a@i, di:a|ri:m) = ¢(a@ilria:w)q(di:1) which treats 
ability and item characteristics as independent. 


e VIBO (Unamortized): We consider q(a@:, di:ar|riji:a) = 
Gb (ri.1.m) (@i)q(dia), which learns separate posteriors 
for each a;, without parameter sharing. Recall the 
subscripts w(ri,1:7) indicate a separate variational pos- 
terior for each unique set of responses. 


If we compare unamortized to amortized VIBO in Fig. 
(top), we see an important efficiency difference. The number 
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of parameters for the unamortized version scales with the 
number of people; the speed shows a corresponding impact, 
with the amortized version becoming an order of magnitude 
faster than the unamortized one. In general, amortized 
inference is much cheaper, especially in circumstances in 
which the number of possible response vectors T1:.4 is very 
large (e.g. 2°° for CritLangAcq). Comparing amortized 
VIBO to the un-amortized equivalent, Table [2] compares the 
wall clock time (sec.) for the 5 real world datasets. While 
VIBO is comparable to MLE and EM (Fig. Bh), unamortized 
VIBO is 2 to 15 times more expensive. 


Exploring accuracy in Fig. |2| (bottom), we see that the un- 
amortized variant is significantly less accurate at predict- 
ing missing data. This can be attributed to overfitting to 
observed responses. With 100 items, there are 2!°° possi- 
ble responses from every person, meaning that even large 
datasets only cover a small portion of the full set. With 
amortization, overfitting is more difficult as the deterministic 
mapping f is not hardcoded to a single response vector. 
Without amortization, since we learn a variational posterior 
for every observed response vector, we may not generalize 
to new response vectors. Unamortized VIBO is thus much 
more sensitive to missing data as it does not get to observed 
the entire response. We can see evidence of this as unamor- 
tized VIBO is superior to amortized VIBO at parameter 
recovery, Fig. |2| (middle), where no data is hidden from the 
model; compare this to missing data imputation, where un- 
amortized VIBO appears inferior: because ability estimates 
do not share parameters, those with missing data are less 
constrained yielding poorer predictive performance. 


Finally, when there are very few people (100) unamortized 
VIBO and HMC are better at recovering parameters (Fig. 
middle) than amortized VIBO. This can be explained by 
amortization: to train an effective regressor fs requires a 
minimum amount of data. With too few responses, the 
amortization gap will be very large, leading to poor inference. 
Under scarce data we would thus recommend using HMC, 
which is fast enough and most accurate. 


Table 2: Time Costs with and without Amortization 


DATASET AMORTIZED (SEC.) |UN-AMORTIZED (SEC.) 
CriTLANGACQ 2.8K 43.2K 
WoORDBANK 176.4 657.1 
DuOoLINGOo 429.9 717.9 
GRADESCOPE 114.5 511.1 
PISA 25.2K 125.8k 


The above suggests that amortization is important when 
dealing with moderate to large datasets. Turning to the 
structure of the amortized posteriors, we note that the fac- 
torization we chose in Thm. [3-J]is only one of many. Specifi- 
cally, we could make the simpler assumption of independence 
between ability and item characteristics given responses in 
our variational posteriors: VIBO (Independent). Such a 
factorization would be simpler and faster due to less gra- 
dient computation. However, in our synthetic experiments 
(in which we know the true ability and item features), we 
found the independence assumption to produce very poor 
results: recovered ability and item characteristics had less 
than 0.1 correlation with the true parameters. Meanwhile 


the factorization we posed in Thm. consistently pro- 
duced above 0.9 correlation. Thus, the insight to decom- 
pose q(@i,di:m|riaz:m) = @(aildi:m, Ti,1:m)q(di:a|ri,1:_) 
instead of assuming independence is a critical one. (This 
point is also supported theoretically by research on faithful 
inversions of graphical models [43].) 


6. DEEP ITEM RESPONSE THEORY 

We have found VIBO to be fast and accurate for inference 
in 2PL IRT, matching HMC in accuracy and EM in speed. 
This classic IRT model is a surprisingly good model for 
item responses despite its simplicity. Yet it makes strong 
assumptions about the interaction of factors, which may 
not capture the nuances of human cognition. With the 
advent of much larger data sets we have the opportunity to 
explore corrections to classic IRT models, by introducing 
more flexible non-linearities. As described above, a virtue 
of VI is the possibility of learning aspects of the generative 
model by optimizing the inference objective. We next explore 
several ways to incorporate learnable non-linearities in IRT, 
using the modern machinery of deep learning. 


6.1 Nonlinear Generalizations of IRT 

We have assumed thus far that p(ri,1:.1|@:, di:ar) is a fixed 
IRT model defining the probability of correct response to 
each item. We now consider three different alternatives with 
varying levels of expressivity that help define a class of more 
powerful nonlinear IRT. 


Learning a Linking Function We replace the logistic 
function in standard IRT with a nonlinear linking function. 
As such, it preserves the linear relationships between items 
and people. We call this VIBO (Link). For person i and 
item j, the 2PL-Link generative model is: 


P(rij|ai,d;) = fo(—aj kj — dj) (16) 


where fg is a one-dimensional nonlinear function followed by a 
sigmoid to constrain the output to be within [0, 1]. In practice, 
we parameterize fg as a multilayer perceptron (MLP) with 
three layers of 64 hidden nodes with ELU nonlinearities. 


Learning a Neural Network Here, we no longer pre- 
serve the linear relationships between items and people and 
instead feed the ability and item characteristics directly into 
a neural network, which will combine the inputs nonlinearly. 
We call this version VIBO (Deep). For person i and item j, 
the Deep generative model is: 


P(rij|ai, dj) = fo(ai, dj) (17) 


where again fg includes a Sigmoid function at the end to 
preserve the correct output signatures. This is an even 
more expressive model than VIBO (Link). In practice, we 
parameterize fg as three MLPs, each with 3 layers of 64 
nodes and ELU nonlinearities. The first MLP maps ability 
to a real vector; the second maps item characteristics to a 
real vector. These two hidden vectors are concatenated and 
given to the final MLP, which predicts response. 


Learning a Residual Correction Although clearly a 
powerful model, we might fear that VIBO (Deep) becomes 
too uninterpretable. So, for the third and final nonlinear 
model, we use the standard IRT but add a nonlinear residual 
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component that can correct for any inaccuracies. We call 
this version VIBO (Residual). For person i and item j, the 
2PL-Residual generative model is: 

1 
1+ eo Bj dj + Fo (ai sk; dj) 


p(rij|as,kj,d;) = (18) 
During optimization, we initialize the weights of the residual 
network to 0, thus ensuring its initial output is 0. This 
encourages the model to stay close to IRT, using the residual 
only when necessary. We use the same architectures for the 
residual component as in VIBO (Deep). 


6.2 Nonlinear IRT Evaluation 

A generative model explains the data better when it as- 
signs observations higher probability. We thus evaluate gen- 
erative models by estimating the log marginal likelihood 
log p(r1:n,1:m) of the training dataset. A higher number 
(closer to 0) is better. For a single person, the log marginal 
likelihood of his or her M responses can be computed as: 


po(Tit:M,; Qi, di:ar) 
qo (@i, di:a|Ti,1:) 

(19) 
We use 1000 samples to estimate Eq. We also measure 
accuracy on missing data imputation as we did in Sec.|5| A 
more powerful generative model, that is more descriptive of 
the data, should be better at filling in missing values. 


log p(Ti,1:m) & log Eg, (a;.d1.m I7i,1:M) 


6.3 Nonlinear IRT Results 

The top half of Table [3] compares the log likelihoods of ob- 
served data whereas the bottom half of Table [3] compares 
the accuracy of imputing missing data. We include VIBO 
inference with classical IRT-1PL and IRT-2PL generative 
models as baselines. We find a consistent trend: the more 
powerful generative models achieve a higher log likelihood 
(closer to 0) and a higher accuracy. In particular, we find very 
large increases in log likelihood moving from IRT to Link, 
spanning 100 to 500 log points depending on the dataset. 
Further, from Link to Deep and Residual, we find another 
increase of 100 to 200 log points. In some cases, we find 
Residual to outperform Deep, though the two are equally 
parameterized, suggesting that initialization with IRT can 
find better local optima. These gains in log likelihood trans- 
late to a consistent 1 to 2% increase in held-out accuracy 
for Link/Deep/Residual over IRT. This suggests that the 
datasets are large enough to use the added model flexibility 
appropriately, rather than overfitting to the data. 


We also compare our deep generative IRT models with the 
purely deep learning approach called Deep-IRT (see 
Sec. |8), that does not model posterior uncertainty. Unlike 
traditional IRT models, Deep-IRT was built for knowledge 
tracing and assumed sequential responses. To make our 
datasets amenable to Deep-IRT, we assume an ordering of 
responses from j = 1 to 7 = M. As shown in Table [3| our 
models outperform Deep-IRT in all 5 datasets by as much 
as 30% in missing data imputation (e.g. WordBank). 


6.4 Interpreting the Linking Function 

With nonlinear models, we face an unfortunate tradeoff be- 
tween interpretability and expressivity. In domains like ed- 
ucation, practitioners greatly value the interpretability of 
IRT where predictions can be directly attributed to ability 


or item features. With VIBO (Deep), our most expressive 
model, predictions use a neural network, making it hard to 
understand the interactions between people and items. 


Figure 4: Learned link functions for (a) CritLangAcq, (b) 
WordBank, (c) DuoLingo, (d) Gradescope, and (e) PISA. 
The dotted black line shows the default logistic function. 


Fortunately, with VIBO (Link), we can maintain a degree 
of interpretability along with power. The “Link” generative 
model is identical to IRT, only differing in the linking func- 
tion (i.e. item response function). Each subfigure in Fig. 
shows the learned response function for one of the real world 
datasets; the dotted black line represents the best standard 
linking function, a sigmoid. We find three classes of linking 
functions: (1) for Gradescope and PISA, the learned function 
stays near a Sigmoid. (2) For WordBank and CritLangAcq, 
the response function closely resembles an unfolding model 
I, which encodes a more nuanced interaction between 
ability and item characteristics: higher scores are related to 
higher ability only if the ability and item characteristics are 
“nearby” in latent space. (3) For DuoLingo, we find a piece- 
wise function that resembles a sigmoid for positive values 
and a negative linear function for negative values. In cases 
(2) and (3) we find much greater differences in log likelihood 
between VIBO (IRT) and VIBO (Link). See Table [3] For 
DuoLingo, VIBO (Link) matches the log density of more 
expressive models, suggesting that most of the benefit of 
nonlinearity is exactly in this unusual linking function. 


7. POLYTOMOUS RESPONSES 


Thus far, we have been working only with response data 
collapsed into binary correct/incorrect responses. However, 
many questionnaires and examinations are not binary: re- 
sponses can be multiple choice (e.g. Likert scale) or even real 
valued (e.g. 92% on a course test). Having posed IRT as a 
generative model, we have prescribed a Bernoulli distribution 
over the i-th person’s response to the j-th item. Yet nothing 
prevents us from choosing a different distribution, such as 
Categorical for multiple choice or Normal for real-values. 
The DuoLingo dataset contains partial credit, computed as 
a fraction of times an individual gets a word correct. A more 
granular treatment of these polytomous values should yield 
a more faithful model that can better capture the differences 
between people. We thus modeled the DuoLingo data using 
for p(ri,1:a1|@i, di:ar) a (truncated) Normal distribution over 
responses with fixed variance. Table [4] show the log densities: 
we again observe large improvements from nonlinear models. 


Item Response Theory can in this way be extended to work 
of all kinds (imagine students writing text, drawing pictures, 
or even coding), encouraging educators to assign open-ended 
work without having to give up proper tools of assessment. 


8. RELATED WORK 
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Table 3: Log Likelihoods and Missing Data Imputation for Deep Generative IRT Models 


DATASET Derp IRT VIBO (IRT-1PL) VIBO (IRT-2PL) VIBO (LINK-2PL) VIBO (DEEp-2PL) VIBO (RES.-2PL) 
CriTLancAca 4 —11249.8 + 7.6 —10224.0+ 7.1 —9590.3 + 2.1 —9311.245.1 —9254.1448 
WorDBANK : —17047.2+ 4.3 —5882.5 + 0.8 —5268.0 + 7.0 —4658.4 + 3.9 —4681.4 + 2.2 
DuoLinco = —2833.3 + 0.7 —2488.3 + 1.4 —1833.9 + 0.3 —1834.2 + 1.3 -1745.444.7 
GRADESCOPE 4 —1090.7 + 2.9 —876.7 + 3.5 —750.8 + 0.1 —705.1 + 0.5 —715.3 + 2.7 
PISA : —13104.245.1 —6169.5 + 4.8 —6120.1+1.3 —6030.2 + 3.3 —5807.3 44.2 
CrirLancAcg _ 0.934 0.927 0.932 0.945 0.948 0.947 
WorDBANK 0.681 0.876 0.880 0.888 0.889 0.889 
DuoLinco 0.884 0.880 0.886 0.891 0.897 0.894 
GRADESCOPE 0.813 0.820 0.826 0.840 0.847 0.848 
PISA 0.524 0.723 0.728 0.718 0.744 0.739 


Table 4: DuoLingo with Polytomous Responses 


InF. ALG. TRAIN TEST 

VIBO (IRT) —22038.07 —21582.03 
VIBO (LINK) —17293.35 —16588.06 
VIBO (DEEP) —15349.84 -—14972.66 
VIBO (REs.) —15350.66 —14996.27 


We described above a variety of methods for parameter esti- 
mation in IRT such as MLE, EM, and MCMC. The benefits 
and drawbacks of these methods are well-documented , SO 
we need not discuss them here. Instead, we focus specifically 
on methods that utilize deep neural networks or variational 
inference to estimate IRT parameters. 


While variational inference has been suggested as a promising 
alternative to other inference approaches for IRT [26], there 
has been surprisingly little work in this area. In an explo- 
ration of Bayesian prior choice for IRT estimation, Natesan 
et al. posed a variational approximation to the posterior: 


p(ai, dj|ri,j) © qg(ai, dj) = qo(ai)qg(dj) (20) 


This is an unamortized and independent posterior family, 
unlike VIBO. As we noted in Sec. both amortization and 
dependence of ability on items were crucial for our results. 


We are aware of two approaches that incorporate deep neural 
networks into Item Response Theory: Deep-IRT and 
DIRT (3). Deep-IRT is a modification of the Dynamic Key- 
Value Memory Network (DKVMN) that treats data as 
longitudinal, processing items one-at-a-time using a recur- 
rent architecture. Deep-IRT produces point estimates of 
ability and item difficulty at each time step, which are then 
passed into a 1PL IRT function to produce the probability of 
answering the item correctly. The main difference between 
DIRT and Deep-IRT is the choice of neural network: instead 
of the DKVMN, DIRT uses an LSTM with attention [47]. In 
our experiments, we compare our approach to Deep-IRT and 
find that we outperform it by up to 30% on the accuracy 
of missing response imputation. On the other hand, our 
models do not capture the longitudinal aspect of response 
data. Combining the two approaches would be natural. 


Lastly, Curi et al. (9] used a VAE to estimate IRT parameters 
in a 28-question synthetic dataset. However, this approach 
modeled ability as the only unknown variable, ignoring items. 
Our analogue to the VAE builds on the IRT graphical model, 
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incorporating both ability and item characteristics in a prin- 
cipled manner. This could explain why Curi et. al. report 
the VAE requiring substantially more data to recover the 
true parameters when compared to MCMC whereas we find 
comparable data-efficiency between VIBO and MCMC. 


9. BROADER IMPACT 


We briefly emphasize the broader impact of efficient IRT in 
the context of education. Firstly, one of the many difficulties 
of accurately estimating student ability is cost: attempting 
to use MCMC on the order magnitude required by large 
entities like MOOCs, local and national governments, and in- 
ternational organizations is impossible. However with VIBO, 
doing so is already possible, as shown by the PISA results. 
Second, efficient IRT is an important and necessary step to 
encourage the development of more complex models of stu- 
dent cognition and response. Namely, it will at least enable 
faster research and iterative testing on real world data. 


10. CONCLUSION 


Item Response Theory is a paradigm for reasoning about 
the scoring of tests, surveys, and similar measurment in- 
struments. Notably, the theory plays an important role in 
education, medicine, and psychology. Inferring ability and 
item characteristics poses a technical challenge: balancing 
efficiency against accuracy. In this paper we have found that 
variational inference provides a potential solution, running 
orders of magnitude faster than MCMC algorithms while 
matching their state-of-the-art accuracy. 


Many directions for future work suggest themselves. First, 
further gains in speed and accuracy could be found by explor- 
ing more or less complex families of posterior approximation. 
Second, more work is needed to understand deep generative 
IRT models and determine the most appropriate tradeoff 
between expressivity and interpretability. For instance, we 
found significant improvements from a learned linking func- 
tion, yet in some applications monotonicity may be judged 
important to maintain — greater ability, for instance, should 
correspond to greater chance of success. Finally, VIBO 
should enable more coherent, fully Bayesian, exploration of 
very large and important datasets, such as PISA [13]. 


Recent advances within AI combined with new massive 
datasets have enabled advances in many domains. We have 
given an example of this fruitful interaction for understanding 
humans based on their answers to questions. 
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